perm filename REPORT[4,ALS] blob
sn#054414 filedate 1973-07-24 generic text, type T, neo UTF8
00010 Some Experiments on Speech Recognition
00100
00200 A serious attempt has been made to apply some machine learning
00300 techniques to the problem of speech recognition by machine.
00400 The general approach has been to develop techniques which will
00500 permit the computer system to adapt itself to the charecteristics
00600 of the speaker by means of a training or learning procedure.
00700
00800 This training procedure might be used to adapt the system to the average
00900 characteristics of a large number of speakers or it could be used to
01000 adapt the system separately to each individual speaker, if the
01100 number of different speakers was not too large. An alternate
01200 arrangement might be to have a system that was alraedy adapted to
01300 the expected characteristics of the speaker, sufficiently
01400 so, at least, so that it could do a partial job of understanding
01500 and so that it could adapt itself to the speaker during the actual
01600 conversation, or failing in this, it could request the speaker to
01700 repeat some simple text that would enable the system to identify
01800 the particular differences in speaker characteristics that were the
01900 cause of misunderstandings.
02000
02100 While the early work here reported has been concentrated on the
02200 extraction and use of acoustic information from the speech,
02300 it should not be infered that the methods are necessarily limited
02400 to this usage. Quite the contrary, the techniques can be equally well used
02500 to combine syntactic, semantic, linguistic, and cultural clues as to
02600 what is being said with the acoustic clues. The need for such
02700 additional information is by now well understood and early attempts
02800 at speech recognition were only marginally successful because of a
02900 failure to recognize this need.
03000
03100 Four quite differend systems have been investigated in some detail.
03200 They all have one characteristic in common in that information
03300 is contained in tables, the so called Signature Tables, as to
03400 the relationships between the acoustic clues contained in the
03500 speech and the desired phonetic (and ultimultly linguistic) output.
03600 They differ in number and size of the tables that are used and in
03700 the way the tables are interconnected. They also differ in the rigour
03800 with which the Baysian probabilities are computed. The earliest
03900 scheme made no attempt at rigour at all but did everything in the
04000 simplest possible way. Subsequent schemes made fewer approximations
04100 and employed somewhat greater table sizes. The most recent scheme
04200 seems to be nearly optimum in terms of the degree of rigour employed
04300 and the sizes of tables envolved.
04400
04500 The program itself consists of a simple procedure for accumulating
04600 information as to the indicated relationships during training
04700 sequences in which known utterances with accompanying phonetic or
04800 linguistic translations are reviewed, and an even simplier procedure
04900 for performing the indicated translations on future unknown utterances.
05000 Only very minor changes have had to be made to the program itself to
05100 adapt it to quite different table schemes and once a fixed scheme is
05200 chosen no changes need be made to accomodate any desired arrangement
05300 of table interconnections, and of course no changes are required to
05400 adapt the tables to different speakers.
05500
05600 Ideally one would like to combine all of the clues that are available
05700 for any particular segment of the speech in attempting to identify its
05800 meaning. This is quite impractical, however, both because of the large
05900 number of clues that are available, (the number of dimnsions of the
06000 clue space) and the range of values that are required to represent each
06100 clue. If the functional relational between these variables were known
06200 for the particular speaker on could make the necessary calculation.
06300 Unfortunately the relationships are not known analytically and since
06400 they vary from speaker to speaker it is impractical to attempt to
06500 learn them. Instead, a subset of the available clues is used to define
06600 an entry in a table and counts are accumulated in this table of the
06700 number of times utterances of particular types are accompanied by
06800 specific combinations of the chosen subset of clues. If the subset is
06900 well chosen and if enough samples have been examined then numbers
07000 can be computed which are in effect the Baysian probabilities that
07100 future instances of these particular combinations of clues will
07200 predict intended meanings of the segments of the utterance.